Building a Large-Scale Annotated Chinese Corpus

نویسندگان

  • Nianwen Xue
  • Fu-Dong Chiou
  • Martha Palmer
چکیده

In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annotation quality, (iii) for what purposes is the corpus applicable, and finally (iv) what future work we anticipate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated JapaneseChinese parallel corpus in the world. We created the corpus by selecting ...

متن کامل

Building an Annotated Japanese-Chinese Parallel Corpus ¨C A Part of NICT Multilingual Corpora

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated JapaneseChinese parallel corpus in the world. We created the corpus by selecting ...

متن کامل

How Should A Large Corpus Be Built? - A Comparative Study Of Closure In Annotated Newspaper Corpora From Two Chinese Sources, Towards Building A Larger Representative Corpus Merged From Representative Sublanguage Collections

This study measures comparative lexical and syntactic closure rates in annotated Chinese newspaper corpora from the Academica Sinica Balanced Corpus and the University of Pennsylvania's Chinese Treebank. It then draws inferences as to how large such corpora need be to be representative models of subject-matterconstrained language domains within the same genre. Future large corpora should be bui...

متن کامل

Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains

Corpus is a valuable resource for information retrieval and data-driven natural language processing systems, especially for spoken dialogue research in specific domains. However, there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. ...

متن کامل

Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures

Parsing is one of the important processes for natural language processing and, in general, a large-scale CFG is used to parse a wide variety of sentences. For many languages, a CFG is derived from a large-scale syntactically annotated corpus, and many parsing algorithms using CFGs have been proposed. However, we could not apply them to Japanese since a Japanese syntactically annotated corpus ha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002